ABSTRACT
Fault Tolerance Techniques facilitate systems to carry out tasks in the incidence of faults. A checkpoint is a local
state of a process saved on stable storage. In a distributed system, since the processes in the system do not share
memory; a global state of the system is defined as a combination of local states, one from each process. In case of a
fault in distributed systems, checkpointing enables the execution of a program to be resumed from a previous
consistent global state rather than resuming the execution from the commencement. In this way, the sum of
constructive processing vanished because of the fault is appreciably reduced. In this paper, we talk about various
issues related to the checkpointing for distributed systems and mobile computing environments. We also confer
various types of checkpointing: coordinated checkpointing, asynchronous checkpointing, communication induced
checkpointing and message logging based checkpointing. We also present a survey of some checkpointing
algorithms for distributed systems.
Keywords: - Check pointing algorithms; parallel & distributed computing; rollback recovery; fault-tolerant
systems.